Suffix Tree Based Chinese Document Feature Extraction and Clustering in RSS Aggregator

نویسندگان

  • Jian Wan
  • Wenming Yu
  • Xianghua Xu
چکیده

In RSS aggregator, the important issue is how to make the feeds information more manageable for RSS subscriber. In this paper, we propose a suffix tree based RSS feeds document clustering in Chinese RSS aggregator. We construct a suffix tree with meaningful Chinese words, and choose the phrases with high score given by a formula as document features. We cluster document using group-average algorithm with a new document similarity measure. The experiment results show that the new method can improve the quality of clustering in document “snippets” scenario, and the speed can meet the demand of “on the fly” clustering.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new keyphrases extraction method based on suffix tree data structure for arabic documents clustering

Document Clustering is a branch of a larger area of scientific study known as data mining .which is an unsupervised classification using to find a structure in a collection of unlabeled data. The useful information in the documents can be accompanied by a large amount of noise words when using Full Text Representation, and therefore will affect negatively the result of the clustering process. S...

متن کامل

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerat...

متن کامل

Suffix Tree Clustering on Post-retrieval Documents

Clustering is used to divide a collection of data into groups based on similarity of objects. With respect to IR, document clustering has been studied. An information retrieval (IR) system would always return a list of retrieved documents to the user. The post-retrieval documents can be clustered in order to help users browse and navigate the searching results. For this purpose, Zamir and Etzio...

متن کامل

A semantics-based method for clustering of Chinese web search results

Information explosion is a critical challenge to the development of modern information systems. In particular, when the application of an information system is over the Internet, the amount of information over the web has been increasing exponentially and rapidly. Search engines, such as Google and Baidu, are essential tools for people to find the information from the Internet. Valuable informa...

متن کامل

A New Cluster Merging Algorithm of Suffix tree Clustering

Document clustering methods can be used to structure large sets of text or hypertext documents. Suffix Tree Clustering has been proved to be a good approach for documents clustering. However, the cluster merging algorithm of Suffix Tree Clustering is based on the overlap of their document sets, which totally ignore the similarity between the non-overlap parts of different clusters. In this pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009